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Abstract 

Multi-agent learning in Markov Decisions Prob- 
lems is challenging because oTthe presence of two 
credit assignment problems: 1) How to credit an 
action taken at time step t for rewards received at 
t' > t ; and 2) How to credit an action taken by 
agent i considering the system reward is a function 
of the actions of all the agents. The first credit as- 
signment problem is typically addressed with tem- 
poral difference methods such as Q-learning or 
TD(A). The second credit assignment problem is 
typically addressed either by hand-crafting reward 
functions that assign proper credit to an agent, or 
by making certain independence assumptions about 
an agent’s state- space and reward function. To 
address both credit assignment 'problems simulta- 
neously, we propose the “Q Updates with Imme- 
diate Counterfactual Rewards-learning” (QUICR- 
leaming) designed to improve both the convergence 
properties and performance of Q-learning in large 
multi-agent problems. Instead of assuming that an 
agent’s value function can be made independent 
of other agents, this method suppresses the im- 
pact of other agents using counterfactual rewards. 
Results on multi-agent grid-world problems over 
multiple topologies show that QUICR-learning can 
achieve up to thirty fold improvements in perfor- 
mance over both conventional and local Q-leaming 
in the largest tested systems. 

1 Introduction 

A critical issue in the multi-agent reinforcement learning pro- 
cess is the structural credit assignment problem: how to re- 
ward a single agent’s action choices when the reward we in- 
tend to maximize is a function of all of the agents’ actions. 
As an example consider how to reward construction rovers 
building a dome. This reward can be computed by measur- 
ing the performance of the rovers’ actions over a series of 
episodes. For example in a single rover scenario, suppose the 
rover took action sequence, a 1? for 100 episodes in simula- 
tions and the dome collapsed 50 times. Then the agent took 
action sequence, ai, for 100 episodes and the dome collapsed 
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30 times. Based on this evidence, we can say with high con- 
fidence that action sequence a<i is better than action sequence 
a\. Now, consider the situation where a large team of rovers 
were used to build the dome. How does one evaluate the ac- 
tions of a single rover in such a case? Suppose rover i took 
action sequence a^i for 100 episodes and the dome collapsed 
50 times. In addition rover i took action sequence, for 
100 different episodes and the dome collapsed 30 times. Can 
we claim that action sequence 2 was better than action se- 
quence a^i? Not with the same confidence with which we 
claimed that action sequence <22 was better than action se- 
quence ai for the single rover case. Furthermore, our confi- 
dence will drop even further as the number of rovers in the 
system grows. This is because in a task with n homogeneous 
agents, on average, the choice of action sequence by agent 
i (a^i or a^ 2 ) will likely have an impact of ^ on the sys- 
tem performance. As a consequence, teasing out the impact 
a Zj i will require far more iterations, as one needs to ascer- 
tain that di y i was indeed a poorer choice than a x ^, and not 
merely taken during episodes where the collective actions of 
the other agents were poor. 

The difficulty here arises from evaluating an individual 
agent’s actions using a function that, a-priori, all the agents 
impact equally. In such a case, a standard Q-leamer is not 
equiped to handle the “structural” credit assignment problem 
of how to apportion the system credit to each individual agent. 
As an alternative, we present “Q Updates with Immediate 
Counterfactual Rewards learning” (QUICR-learning), which 
uses agent-specific rewards that suppress the impact of other 
agents. Rewards in QUICR-learning are both heavily agent- 
sensitive, making the learning task easier and aligned with 
the system level goal, ensuring that agents receiving high re- 
wards are helping the system as a whole. These agent-specific 
reward functions are then used with standard temporal differ- 
ence methods to create a learning method that is significantly 
faster than standard Q-leaming in large multi-agent systems. 

Currently the best multi-agent learning algorithms address 
the structural credit assignment problem by leveraging do- 
main knowledge. In the robotic soccer for example, player 
specific subtasks are used, followed by tiling to provide good 
convergence properties [Stone et al , 2005]. In a robot coor- 
dination problem for the foraging domain, specific mles in- 
ducing good division of labor are created [Jones and Mataric, 
2003]. In domains where groups of agents can be assumed to 


be independent, the task can be decomposed by learning a set 
basis functions used to represent the value function, where 
each basis only processed a small number of the state vari- 
ables [Guestrin et ai, 2002]. Also multi-agent Partially Ob- 
servable Markov Decision Precesses (POMDPs) can be sim- 
plified through piecewise linear rewards [Nair et ai , 2003]. 
There have also been several approaches to optimizing Q- 
Ieaming in multi-agent systems that do not use independence 
assumptions. For a small number of agents, game theoretic 
techniques were shown to lead to a multi-agent Q-leaming 
algorithm proven to converge [Hu and Wellman, 1998]. In ad- 
dition, the equivalence between structural and temporal credit 
assignment was shown in [Agogino and Turner, 2004], and 
methods based on Bayesian methods were shown to improve 
multi- agent learning by providing better exploration capabil- 
ities [Chalkiadakis and Boutilier, 2003]. In large systems 
game theory has addressed credit assignment in congestion 
problems- with Vicl^e yTd lstVickr-e.yr-l 96-1-k 

In this paper, we present QUICR-leaming which provides 
fast convergence in multi-agent learning domains, without as- 
suming that the full system reward is linearly separable or re- 
quiring hand tuning based on domain knowledge. In Section 
2 we discuss the temporal and structural credit assignment 
problems in multi- agent systems, and describe the QUICR- 
leaming algorithm. In Section 3 we present results on two 
variants of a multi-agent gridworld problem, showing that 
QUICR-leaming performs up to thirty times better than stan- 
dard Q-leaming in multi-agent problems. 

2 Credit Assignment Problem 

The multi-agent temporal credit assignment problem consists 
of determining how to assign rewards (e.g., credit) for a se- 
quence of actions. Starting from current time step t, the 
undiscounted sum of rewards till a final time step T can be 
represented by: 

T-t 

Rt(s t (a)) = r t+fc (st+fc( a)) . (1) 

fe =0 

where a is^ a vector containing the actions of all agents at all 
time steps, s t (a) is the state function returning the state of all 
agents for for a single time step, and r*(s) is the single-time- 
step reward function, which is a function of the states of all 
of the agents. 

This reward is a function of all of the previous actions of 
all of the agents. Every reward is a function of the states of all 
the agents, and every state is a function of all the actions that 
preceded it (even though it is Markovian, the previous states 
ultimately depend on previous actions). In a system with n 
agents, on average * T actions affect reward. Agents need 
to use this reward to evaluate their single action, yet even in 
the idealized domains presented Section 3, with one hundred 
agents and thirty-two time steps there are an average of 1 600 
actions affecting the reward! 

2.1 Standard Q-Learning 

Reinforcement learners such as Q-learning address how to 
assign credit of future rewards to an agent’s current ac- 


tion. The goal of Q-learning is to create a policy that max- 
imizes the sum of future rewards, R t (s t (a)), from the cur- 
rent state [Kaelbling et ai , 1996; Sutton and Barto, 1998; 
Watkins and Dayan, 1992]. It does this by maintaining ta- 
bles of Q- values, which estimate the expected sum of future 
rewards for a particular action in a particular state. In the 
TD(0) version of Q-learning, a Q- value, Q{s t , a t ), is updated 
with the following Q-leaming rule 5 : 

AQ(s t . a t ) — a(rt 4- max a Q(st+i,a)) . (2) 

The assumption with this update is that the action a t 
is most responsible for the immediate reward r t , but is 
somewhat less responsible for the sum of future rewards, 

Y^kZi r t+k{$t+k(&))- This assumption is reasonable since 
rewards in the future are affected by uncertain future actions 
and noise in state transitions. Instead of using the sum of fu- 
ture rewards directly to update its table, Q-leaming uses a Q- 
value from the next state entered as an estimate for those fu- 
ture rewards. Under benign assumptions, Q- values are shown 
to converge to the actual value of the future rewards [Watkins 
and Dayan, 1992]. 

Eventhough Q-leaming addresses the temporal credit as- 
signment problem (i.e., properly apportions the effects of all 
actions taken at other time steps to the current reward), the 
immediate reward in a multi-agent system still suffers from 
the structural credit assignemtn problem (i.e., the reward is 
still a function of all the agents’ actions). Standard Q-leaming 
does not address this structural credit assignment problem 
and an agent will get full credit for actions taken by all of the 
other agents. As a result when they are many agents, standard 
Q-learning is generally slow since it will take many episodes 
for an agent to figure out its impact on a reward it barely in- 
fluences. 

2.2 Local Q-Learning 

One way to address the structural credit assignment problem 
and allow for fast learning is to assume that agents’ actions 
are independent. Without this assumption, the immediate re- 
ward function for a multi-agent reward system may be a func- 
tion of all the states: 

s t , 2 ( 0 - 1 ), ■■■, s t ,n(o n )) , 

where s tyi (ai) is the state for agent i and is a function of only 
agent is previous actions. The number of states determining 
the reward grows linearly with the number of agents, while 
the number of actions that determine each state grows linearly 
with the number of time steps. To reduce the huge number of 
actions that affect this reward, often the reward is assumed to 
be linearly separable: 

r t (s t ) = y^Wjr M (s M (aj)) ■ 

i 

Then each agent receives a reward r tyi which is only a func- 
tion of its action. Q-leaming is then used to resolve the re- 
maining temporal credit assignment problem. If the agents 

Ir This paper uses undiscounted learning to simplify notation, but 
all the algorithms also apply to discounted learning as well. 



are actually independent, this method leads to a significant 
speedup in learning as an agent receives direct credit for its 
actions. If the agents are coupled, then the independence as- 
sumption still allows fast learning, but the agents will tend to 
converge to the wrong policy. With loose coupling the bene- 
fits of the assumption may still outweigh the costs when there 
are many agents. However, when agents are tightly coupled, 
the independence assumption may lead to unacceptable solu- 
tions and may even converge to a solution that is worse than 
random [Wolpert and Turner, 2001]. 

2.3 QUICR-Learning 

In this section we present QUICR-learning, a learning algo- 
rithm for multi-agent systems that does not assume that the 
system reward function is linearly separable. Instead it uses 
a mechanism for creating rewards that are a function of all 
of the agents, but still provide many of the benefits of hand- 
jaaftedjcs&flid^ 

gorithms exploit detailed knowledge about a domain to pro- 
vide agent rewards that allow the system to maximize a global 
reward. These rewards are designed to have two beneficial 
properties: they are “aligned” with the overall learning task 
and they have high “sensitivity” to the actions of the agent. 

The first property of alignment means that when an agent 
maximizes its own reward it tends to maximize the overall 
system reward. Without this property, a large multi-agent 
system can lead to agents performing useless work, or worse, 
working at cross-purposes. Reward sensitivity means that an 
agent’s reward is more sensitive to its own actions than to 
other agents actions. This property is important for agents to 
leam quickly. 

QUICR-learning is based on providing agents with rewards 
that are both aligned with the system goals and sensitive to the 
agent’s states. It aims to provide the benefits of hand-crafted 
algorithms without requiring detailed domain knowledge. In 
a task where the reward can be expressed as in Equation 1, let 
us introduce the difference reward (adapted from [Wolpert 
and Turner, 2001]) given by: 

Dl(s t (a)) = R t {s t {a)) - R t {s t {a - d tji )) 

where a — at,i denotes a counterfactual state where agent i 
has not taken the action it took in time step t (e.g., the ac- 
tion of agent i has been removed from the vector containing 
the actions of ail the agents before the system state has been 
computed). Decomposing further, we obtain: 

T-t 

Dl(s t (a)) = Y^ r t+k(s t +k(a)) - r t+k (s t+k (a - a t>i )) 

/c=0 

T-t 

= ^ ^ A't-j-fc (^t-f fc (&), 5t-ffc((Z • (3) 

k= 0 

where dt($i, S 2 ) = - r t (s 2 )- (We introduce the single 

time step “difference” reward d t to keep the parallel between 
Equations 1 and 3). This reward is much more sensitive to an 
agent’s action than r t since much of the effects of the other 
agents are subtracted out with the counterfactual [Wolpert and 
Turner, 2001]. Unfortunately in general d t {si,S 2 ) is non- 
Markovian since the second parameter may depend of pre- 


vious states, making its use troublesome in a learning task 
involving both a temporal and structural credit assignment. 

In order to overcome this shortcoming of Equation 3, let us 
make the following two assumptions: 

1 . The counterfactual a — a ui action moves agent i to an 
absorbing state, 55 that is independent of of its current 
state. 

2. The future state of agents other than agent i are not af- 
fected by the actions of agent i. 

The first assumption forces us to compute a counterfactual 
state that is not necessarily a minor modification to agent z’s 
current state. Therefore, differential function estimation tech- 
niques that rely on a small change in agent z’s (e.g., Taylor se- 
ries expansion) state cannot be used. However, each agent’s 
countefactual state is for itself (e.g, not computed for other 
agents) and a single time step (e.g., the countefactual states do 
not pro paga te through time). The second assumption holds in 
many multi-agent systems, since to reduce the state-space to 
manageable levels, agents often do not directly observe each 
other (though are still coupled through the reward). 

Given these conditions, the counterfactual state for time £4- 
k is computed from the actual state at time t + k, by replacing 
the state of agent z at time t with s b . Now the difference 
reward can be made into a Markovian function: 

dl(s t ) = r t (s t ) - r t ($ t - s t , a + s b ) , (4) 

where the expression s t — $ t ^ 4* Sb denotes replacing agent 
z’s state with state $b . 

Now the Q-leaming rule can be applied to the difference 
reward, resulting in the QUICR-learning rule: 

AQ(s t , a t ) = a(r t (s t ) - r t (s t - 4- s b ) 

+max a Q(st+i, cl)) 

= ®(dl(s t ) + rnax a Q(s t +i,a)) (5) 

Note that since this learning rule is Q-learning, albeit ap- 
plied to a different reward structure, it shares all the conver- 
gens properties of Q-leaming. In order to show that Equa- 
tion 5 leads to good system level behavior, we need to show 
that agent T maximizing d\{s t ) (e.g., following Equations) 
will maximize the system reward r t . Note that by definition 
(s t — $t,i 4* Sb) is independent of the actions of agent z, since 
it is formed by moving agent i to the absorbing state s b from 
which it cannot emerge. This effectively means the partial 
differential of <2J(s t ) with respect to agent i is 2 : 



— (r t {s t ) - r t (s t - s t ,t + 5 &)) 

Oi 

d d 

14 r t(st) - - s ui + s h ) 

Oi Oi 


eT‘ (s *> “ 0 

f •<“•)• 


( 6 ) 


2 Though in this work we show this result for differentiable states, 
the principle applies to more general states, including discrete states. 


Therefore any agent i using a learning algorithm to op- 
timize dl($ t ) will also optimize r t (s t ). Furthermore, note 
that QUICR-learning converges not only to a globally de- 
sirable solution (e.g., it statisfies the first property of being 
aligned with the system level goal), but it also converges 
faster since the rewards are more sensitive to the actions of 
agent i because it removes much of the effects of the other 
agents through the counterf actual subtraction. 

3 Multi-agent Grid World Experiments 

We performed a series of simulations to test the performance 
of Q-Learning, Local Q-Learning and QUICR-Learning for 
multi-agent systems. We selected the the multi-agent Grid 
World Problem, a variant of the standard Grid World Prob- 
lem [Sutton and Barto, 1998]. In this problem, at each time 
step, the agent can move up, down, right or left one grid 
s quare, and receives a reward (possibly zero) after each move. 
The observable state space for the agent is its grid coordinate 
and the reward it receives depends on the grid square to which 
it moves. In the episodic version, which is the focus of this 
paper, the agent moves for a fixed number of time steps, and 
then is returned to its starting location. 

In this paper we compare learning algorithms in a multi- 
agent version of the grid world problem. In this instance of 
the problem there are multiple agents navigating the grid si- 
multaneously influencing each others’ rewards. In this prob- 
lem agents are rewarded for observing tokens located in the 
grid. Each token has a value between zero and one, and each 
grid square can have at most one token. When an agent moves 
into a grid square it observes a token and receives a reward for 
the value of the token. Rewards are only received on the first 
observation of the token. Future observations from the agent 
or other agents do not receive rewards in the same episode. If 
two agents move into the same square at the same time. More 
precisely, r t is computed by summing the agents at the same 
location as unobserved tokens, weighted by the value of the 
tokens: _ _ 

= CD 

* 3 

where P is the indicator function which returns one when an 
agent in state s t ,i is in the location an unobserved token Lj. 
The global objective of the multi- agent Grid World Problem 
is to observe the highest aggregated value of tokens in a fixed 
number of time steps T. 

3.1 Learning Algorithms 

In each algorithm below, we use the TD(0) update rule. The 
standard Q-learning is based on the full reward r t : 

A Q(s t ,a t ) =a(r t (s t ) + max a Q(s t+1> a)) . (8) 

Local Q-learning is only a function of the specific agent’s own 
state: 

AQ loc (s t ,a t ) = + max a Qi oc (s t+1 ,a )). 

3 

QUICR-learning instead updates with a reward that is a func- 
tion of all of the states, but uses counterfactuals to suppress 



Figure 1: Distribution of Token Values in “Corner” World 

the effect of other agents’ actions: 

&QuECR(st,a t ) = a(r t (s t ) - r t {s t -s t<i + s b ) + 
max J^LLEcsi& t + 1 , a)) t 

where s t — s t> i 4- S5 is the state resulting from removing agent 
i f s state and replacing it with the absorbing state s&. 

3.2 Results 

To evaluate the effectiveness of QUICR-learning in the multi- 
agent Grid World, we conducted experiments on two different 
types of token distributions. The first set of tokens is designed 
to force congestion and tests the ability of QUICR-learning 
in domains where the reward function is far from being lin- 
early separable. The second set is randomly generated from 
Gaussian kernels, to illustrate that the QUICR-learning capa- 
bilities in a non-hand crafted domain with spread out tokens 
(a domain favoring less dependend, local learners). 

In all the experiments the learning rate was set to 0.5, the 
actions were chosen using an e-greedy (e = 0.15) exploration 
scheme and tables were initially set to zero with ties broken 
randomly. 

3.3 Corner World Token Value Distribution 

The first experimental domain we investigated consisted of 
a world where the “highly valued” tokens are concentrated in 
one corner, with a second concentration near the center where 
the rovers are ihitially located. Figure 1 conceptualizes this 
distribution for a 20x20 world. 

Figure 2a shows the performance for 40 agents on a 400 
unit-square world for the token value distribution shown in 
Figure 1, and where an episode consists of 20 time steps (er- 
ror bars of ± one a are included, though in most cases they 
are smaller than the symbols). The performance measure in 
these figures is sum of full rewards (r*(s*)) received in an 
episode, normalized so that the maximum reward achievable 
is 1.0. Note all learning methods are evaluated on the same 
reward function, independent of the reward function that they 
are internally using to assign credit to the agents. 

The results show that local Q-leaming generally produced 
poor results. This problem is caused by all agents aiming 
to acquire the most valuable tokens, and congregating to- 
wards the comer of the world where such tokens are located. 
In essence, in this case agents using local Q-learning com- 
peted, rather than cooperated. The agents using standard Q- 
leaming did not fare better, as the agents were plagued by 
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Figure 2: Learning Rates in Corner World with 40 Agents 
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Figure 3: Scaling Properties of Different Payoff Functions. 

the credit assignment problem associated with each agent re- 
ceiving the full world reward for each individual action they 
took. Notice though that though local Q learning agents hit 
a performance plateau early; the standard Q-learning agents 
continue to learn, albeit slowly. This is because standard Q- 
leaming agents are aligned with the system reward, but have 
low agent sensitivity. Agents using QUICR-Iearning on the 
other hand learned rapidly, outperforming both local and stan- 
dart Q-l earning by a factor of six (over random rovers). 

Figure 3 explores the scaling properties for each algorithm. 
As the number of agents was increased, the difficulty of the 
problem was kept constant by increasing the size of the grid- 
world, and allocating more time for an episode. Specifically 
the ratio of the number of agents to total number of grid 
squares and the ratio of the number of agents to total value 
of tokens was held constant. In addition the ratio of the fur- 
thest grid square from the agents’ starting point to the total 
amount of time in an episode was also held constant (e.g., 
40 agents, 20x20 grid, 20 steps, 100 agents, 32x32 grid, 32 
time steps). The scaling results show that agents using both 
local and standard Q-learning deteriorate rapidly as the num- 
ber of agents increases. Agents using QUICR-leaming on the 



Figure 4: Distribution of Token Values in “Random” World 
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Figure 5: Learning Rates in Random World with 40 Agents 

other hand were not strongly affected by the increase in the 
size of the problem, and outperformed local and standard Q- 
lerners by a factor of thirty for the largest system. This is 
because QUICR-Iearning agents received rewards that were 
both aligned with the sytem goal had high agent sensitivity 
(i.e.,less affected by the size of the system). This result un- 
derscores the need for using rewards that suppress the affect 
of other agents actions in large systems. 

3.4 Random World Token Value Distribution 

In the second set of experiments, we investigate the behav- 
ior of agents in a grid world where the token values are ran- 
domly distributed. In this world, for n agents, there are n/3 
Gaussian ‘attractors’ whose centers are randomly distributed. 
Figure 4 shows an instance of the gridworld using this distri- 
bution for the 20x20 world, used in the experiments with 40 
agents. 

The results in Figures 5 and 6 show that agents using 
QUICR-leaming are insensitive to changes in the token value 
distribution. Agents using local Q-leaming perform signifi- 
cantly better in this case, showing a much larger sensitivity 
to the token distribution. The improvements are due to the 
spreading of tokens over a larger area, which allows agents 
aiming (and failing) to collect high valued tokens to still col- 
lect mid to low- valued tokens, surrounding the high valued 
tokens. In this domain, agents using standard Q-learning per- 





formed particularly poorly. Indeed they were outperformed 
by agents performing random walks for systems of 40 agents. 
Due to the distributed tokens and the large number of agents, 
agents using standard Q-leaming were never able to form an 
effective policy. These agents essentially moved from their 
initial random walk to a slightly more coordinated random 
walk, causing them to spread out less than agents performing 
independent random walks, and thus perform slightly worse 
than random agents. 

The scaling results show again that QUICR- learning per- 
forms well as the number of agents increases. The interesting 
result here is that QUICR-Ieaming does better with 40 to 55 
agents than with 10 agents (this is not a statistical quirk, but 
a repeated phenomenon in many different random token con- 
figurations) . One potential cause is that the larger number of 
agents more efficiently explore the space without being be- 
set by the problems encountered by the other learning algo- 
rithms. The scaling results confirm that standard Q-learners 
perform slightly coordinated random walks in this setting, 
performing ever so slightly worse than random in all cases 
with more than 25 agents. With this token distribution the 
rewards used in standard Q-learning appear to be no better 
than random rewards, even with ten agents. While local Q- 
learning performs better on this token distribution than the 
previous one, it still scales less gracefully that does QUICR- 
Ieaming. 

4 Discussion 

Using Q-leaming to learn a control policy for a single agent 
in a problem with many agents is difficult, because an agent 
will often have little influence over the reward it is trying to 
maximize. In our example problems, an agent’s reward re- 
ceived after an action could be influenced by as many as 3200 
other actions from other time-steps and other agents. Even 
temporal difference methods that perform very well in single 
agent systems will be overwhelmed by the number of actions 
influencing a reward in the multi-agent setting. To address 
this problem, this paper introduced QUICR-learning, which 
aims at reducing the impact of other agent’s actions without 
assuming linearly separable reward functions. Within the Q- 


learning framework QUICR-learning uses the difference re- 
ward computed with immediate counterfactuals. While elim- 
inating much of the influence of other agents, this reward was 
shown mathematically to be aligned with the global reward: 
agents maximizing the difference reward will also be maxi- 
mizing the global reward. Experimental results in two Grid 
World problems, confirm the analysis showing that QUICR- 
Ieaming learns in less time than standard Q-learning, and 
achieves better results than Q-leaming variants that use local 
rewards and assume linear separability. While this method 
was used with TD(0) Q-learning updates, it also naturally ex- 
tends to TD(A), Sarsa-learning and Monte Carlo estimation. 
In domains with difficult temporal credit assignment issues, 
the use of these other variants could be beneficial. 
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